Incremental Computation of Linear Machine Learning Models in Parallel Database Systems

ثبت نشده
چکیده

We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear machine learning models widely used in big data analytics. We prove that computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features for scalable computation. By exploiting Gamma, iterative algorithms are changed to work in two phases: (1) Incremental-parallel data set summarization (i.e. in one scan and distributive); (2) Iteration in main memory exploiting the summarization matrix in intermediate matrix computations (i.e. reducing number of scans). Assuming the machine learning model is based on Gaussian distributions, we show that the covariance (and correlation) matrix, present in every Gaussian model, can be derived directly from Gamma. Therefore, many intermediate computations on large matrices collapse to computations based on Gamma, a much smaller matrix. We justify it is necessary to develop specialized database algorithms for dense and sparse matrices, respectively, and we introduce a density threshold to decide either algorithm. Assuming a distributed memory model (i.e. shared-nothing) and a larger number of points than processing nodes, we show computing Gamma exhibits close to linear speedup. We study how to compute Gamma with existing database systems processing mechanisms and their impact on time complexity. At the same time we also highlight weaknesses and limitations of our proposal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two-stage fuzzy-stochastic programming for parallel machine scheduling problem with machine deterioration and operator learning effect

This paper deals with the determination of machine numbers and production schedules in manufacturing environments. In this line, a two-stage fuzzy stochastic programming model is discussed with fuzzy processing times where both deterioration and learning effects are evaluated simultaneously. The first stage focuses on the type and number of machines in order to minimize the total costs associat...

متن کامل

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear Gaussian models, widely used in big data analytics. Computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features ...

متن کامل

Big Data Analytics in Bioinformatics: A Machine Learning Perspective

Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative pr...

متن کامل

Fast and Eecient Algorithms for Video Compression and Rate Control

grated to the United States of America in 1975 with his parents, Dzuyet D. Hoang and Tien T. Tran, and two sisters. He now has three sisters and one brother. They have been living in Harvey, Louisiana. a Fulbright Scholar, and an IBM Faculty Development Awardee. He is coauthor of the book Design and Analysis of Coalesced Hashing and is coholder of patents in the areas of external sorting, predi...

متن کامل

Machine learning algorithms in air quality modeling

Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016